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NEW TEACHER EVALUATION SYSTEMS have emerged as the corner- 
stone of the recent movement to improve public school teaching. Fueled by incentives 
from the federal government, state and local policymakers have sought to replace the 
often-cursory evaluation models of the past with more comprehensive ones. In con- 
trast to past evaluations, which often relied on a single classroom visit by an untrained 
administrator, new models evaluate teachers on the basis of their students’ achieve- 
ment, on surveys that capture students’ perceptions of their teachers’ practice, and on 
improved classroom observations. 


The inclusion of student-performance 
measures has been highly controversial. 
But research conducted by the Brookings 
Institution in four urban districts found 
that only 22 percent of teachers had 
test-score gains factored into their evalua- 
tions. 1 In these districts and elsewhere, 
observations of teachers’ work in their 
classrooms continue to generate the 
majority of the performance informa- 
tion under the new evaluation systems. 
Policymakers have sought to improve 
upon traditional observation systems by 
tying observation to rigorous rubrics, 


intensifying observer training, and 
increasing the number of required obser- 
vations. Between 2011 and 2013 alone, 
the number of states requiring teachers to 
undergo multiple observations each year 
increased from under 10 to 25. 2 

But as these new systems roll out, there is 
mounting evidence that principals alone 
cannot bear the time burden they impose. 
Nor can a single principal be depended 
upon to deliver effective feedback across 
content areas to teachers with vastly dif- 
ferent strengths, weaknesses, and teaching 
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assignments. In response to these challenges, a grow- 
ing number of districts have adopted multi-rater 
systems, in which several observers watch teachers at 
work, score their performance, and provide feedback. 
Sometimes the raters observe together, sometimes in- 
dependently. And more and more, they come to the 
process from different vantage points: Many districts 

As these new systems roll out, 
there is mounting evidence that 
principals alone cannot bear the time 
burden they impose. Nor can a single 
principal be depended upon to deliver 
effective feedback across content 
areas to teachers with vastly different 
strengths, weaknesses, and teaching 



now rely on combinations of peer teachers, master 
teachers, and administrators from different schools. 
By adding more eyes to these evaluations, districts 
aim not only to relieve principals but, more impor- 
tant, to lend new perspectives, deeper expertise, and 
greater objectivity to the evaluation process. 

This report explores the use of multi-rater evaluation 
systems in 16 districts with widely varying student 
populations, resources, and policy priorities. The dis- 
tricts range from New York City, the nation’s largest 
school system, to Transylvania County, NC, which 
educates just 3,500 students each year. Drawing on 
document reviews and interviews with district offi- 
cials, it examines the districts’ varying aspirations for 
multi-rater models, as well as how the models are 
designed, how they operate, and the challenges they 
pose. This report is not intended to be a technical 


assessment of these systems, nor does it take into 
account processes related to non-scored observa- 
tion, such as observation by coaches that provide 
only formative feedback. Although this brief does 
not explore the training of raters in depth, training 
is an increasingly important topic with significant 
implications for the quality and equity of evalua- 
tion processes both within and across districts. Rater 
training will be the topic of a future Carnegie brief. 

The Problems with Observing Teaching 

It is increasingly clear from research and practice that 
relying solely on principals to conduct classroom 
observations is problematic. Time demands, for one 
thing, present a substantial burden to principals. 
After the first year of district-wide implementation 
of Chicago’s new evaluation system, for instance, 66 
percent of administrators said that the system’s in- 
creased observation requirement took too big a chunk 
of their already overscheduled time. 3 Administrators 
in Tennessee felt similarly after the first year of that 
state’s evaluation system. After conducting nearly 
300,000 classroom observations, school administra- 
tors declared the time burden “unmanageable.” 4,5 
Says Vince Botta, a former principal who is now the 
director of performance management in Georgia’s 
Gwinnett County Public Schools: “There is no way 
as a principal you can do it all alone.” 6 

Principals who do try to “do it all,” research sug- 
gests, may end up hurting the process. A 2013 study 
from the Consortium on Chicago School Research 
observed that the heavier workloads caused by the 
demand for teacher evaluation have “contributed to 
lower [than required] engagement in the new system 
for some principals.” 7 More specifically, researchers 
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from Stanford University find that some overbur- 
dened administrators “cut corners... typically doing 
fewer [evaluations] than desired or even required.” 8 
While this behavior is understandable given the 
responsibilities administrators juggle, it could sig- 
nificantly undermine the improvement aims of 
evaluation by cutting the time a principal needs to 
conduct high-quality observations and to prepare 
and deliver targeted, actionable feedback. 

Teachers who believed that 
administrators did not spend enough 
time in their classrooms questioned the 
validity of their evaluators’ assessment, 
doubted that the appraisal reflected an 
understanding of their daily work, and 
complained that the evaluation process 
lacked credibility. J J 

Quick, cursory observations are also likely to dam- 
age teachers’ trust in the evaluation system, further 
undermining efforts to improve their instruction. In 
a study of six urban schools, Harvard University re- 
searcher Stefanie Reinhorn found that “teachers who 
believed that administrators did not spend enough 
time in their classrooms questioned the validity of 
their evaluators’ assessment, doubted that the ap- 
praisal reflected an understanding of their daily work, 
and complained that the evaluation process lacked 
credibility.” 9 And even if evaluators do spend enough 
time in teachers’ classrooms, Reinhorn found, teach- 
ers felt strongly that “their evaluators need to have 
[content knowledge and pedagogical knowledge] in 
order to validly assess their practice, provide valuable 
feedback, and support individual teachers’ growth.” 


In other words, they needed expertise that principals 
can’t be expected to have across a full range of grade 
levels and subjects. 10 

Although there were some skilled evaluators in 
Reinhorn’s study, additional research suggests that 
these evaluators have been the exceptions rather 
than the norm. According to a research synthesis 
done by Heather Hill of Harvard University and 
Pamela Grossman of Stanford University (now of 
the University of Pennsylvania), “many principals 
lack the knowledge and expertise to provide con- 
tent-specific feedback,” especially in math. 11 

A Potential Solution: Multiple Raters 

Multi-rater systems show promise in addressing 
these challenges. For one thing, they are designed 
to give teachers more time with more evaluators. If 
calibrated carefully, says Reinhorn, multi-rater sys- 
tems can “address the shortage of time or specialized 
expertise” that lead to teachers’ concerns. 12 

A growing body of research suggests that multi-rater 
designs might improve the reliability of evaluation 
scores, too. The Measures of Effective Teaching proj- 
ect, funded by the Bill & Melinda Gates Foundation, 
found that “if a school district is going to pay the cost 
(both in money and time) to observe two lessons for 
each teacher it gets greater reliability when each les- 
son is observed by a different person.” 13 Likewise, a 
2014 Mathematica study on Pittsburgh’s evaluation 
system found that ratings might be improved if more 
than one observer rated each teacher. This finding 
was due partly to inconsistencies in how principals 
applied the rating rubric. 14 And when systems match 
classroom teachers with observers who are experts in 
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the same grade or subject, the result may be more 
nuanced assessments of teachers’ practice, as well as 
more precise feedback to drive improvement — not 
only in teachers’ practice, but in principals’ evalua- 
tive and feedback skills as well. 15 


(( When systems match classroom 
teachers with observers who are 
experts in the same grade or subject, 
the result may be more nuanced 
assessments of teachers’ practice, 
as well as more precise feedback 
to drive improvement. 


Consistent with the literature examining multi-rater 
systems, nearly every district in our review cited re- 
duced workloads for principals as a key reason for 
incorporating additional raters. Most also believed 
that adding observations and observers would pro- 
duce more and more varied information on which 
to evaluate a teacher’s practice. And several expressed 
hope that this better data, paired with added content 
or grade-level expertise, would lead to more accurate 
evaluation scores and richer, more tailored feedback. 
Despite broadly consistent goals for their multi-rater 
approaches, the districts’ varying budgets, popula- 
tions, and local and state contexts yield a wide range 
of designs. In general, the systems vary along three 
key dimensions: teacher type (which teachers must 
be scored by multiple raters); rater type (who can 
serve as raters); and rater role (how raters share re- 
sponsibilities for observation). 


What Teachers Get Rated? 

Districts that use multiple raters must decide which 
teachers or subgroups of teachers will be observed 
by multiple raters. This decision is based on two fac- 
tors: what the district hopes to accomplish by using 
multiple raters and what resources it has to sup- 
port the design. Districts primarily concerned with 
improving the precision and validity of scoring in 
high-stakes cases might require multiple raters only 
for those teachers on track to receive the highest or 
lowest scores — teachers who may be headed for rec- 
ognition or sanction. This is the policy, for instance, 
in New Haven, CT. Requiring that all teachers be 
observed by multiple raters is a more resource-inten- 
sive approach, because it generally requires training 
and compensating additional raters and more careful 
coordination than a targeted design. Nonetheless, 
five of the 16 districts — Desoto Parish, LA; 16 Eagle 
County, CO; Santa Fe, NM; Maricopa County, 
AZ; 17 and Hillsborough County, FL — have taken 
this approach, many citing that it is necessary to 
build a common culture around evaluation and pro- 
fessional improvement. However, because of resource 
concerns, some of these districts plan to eventu- 
ally differentiate how often teachers are observed. 
The District of Columbia Public Schools (DCPS), 
whose IMPACT evaluation system initially required 
that all teachers be observed by a combination of 
master educators and school administrators, has al- 
ready taken this step. Teachers who have earned five 
consecutive years of “highly effective” or “effective” 
ratings — expert teachers — are now observed only 
by administrators. This arrangement allows the dis- 
trict to vary the intensity of observation and support 
teachers receive. 18 
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As with New Haven, five other districts — Tran- 
sylvania County, Greenville County, Greene County, 
Baltimore City, and New York City — use multiple 
raters only for novices or teachers who have earned 
very low performance ratings in the past. And in 
Greenville County, in accordance with state law, 
both probationary teachers and teachers at risk for 
termination must be observed by teams of evalua- 
tors. These strategies direct the benefits of multiple 
raters to those teachers most in need of supervision 
and support. 

Four districts allow schools to use multiple raters but 
do not require they be used generally or for any par- 
ticular subgroup of teachers. Most of these districts 
have determined that requiring the practice would 
be unfeasible due to logistical challenges, particularly 
at schools with just one administrator or with ad- 
ministrators who oversee multiple sites . 19 Officials in 
each of these four districts believe some schools have 
adopted multi-rater approaches, but none could say 
for certain. Jana Burk, executive director of Tulsa’s 
Teacher/Leader Effectiveness Initiative, says that “it’s 
only when really high-caliber principals are in charge 
[that they] find time to make it happen.” 

Who Does the Rating? 

Another design consideration for districts is the ques- 
tion of who should serve as raters. Some districts use 
only administrators, both district- and school-based; 
other districts use administrators and other expert 
raters, such as master teachers and mentor teachers. 
Others use administrators and teachers’ peers . 20 

More than half of the districts rely solely on admin- 
istrators to serve as raters. For several, this approach 


is the only option since state requirements or local 
bargaining agreements restrict high-stakes observa- 
tions to people at this level. Although teachers in 
these districts are usually observed primarily by their 
principals or assistant principals, some districts also 


(( Some districts use only 
administrators, both district- and 
school-based; other districts use 
administrators and other expert raters, 
such as master teachers and mentor 
teachers. Others use administrators 
and teachers’ peers. W 


tap central office staffers or others with administra- 
tive credentials. This arrangement is often the case 
in districts where schools may have only one on-site 
administrator. 

Six districts use distinguished teachers — called mas- 
ter teachers, mentor teachers, peers, or validators — to 
rate teachers and (in some cases) provide feed- 
back, supplementing administrators’ observations. 
Districts varied in their approaches to recruiting 
these expert raters. While some, such as DeSoto 
Parish, Eagle County, and Hillsborough County, 
hire most or all of their teacher-leaders from within 
the district, others prefer to recruit from the outside. 
New Haven Public Schools, for example, employs 
a cadre of evaluators called “third-party validators,” 
most of whom have worked in various capacities in 
other nearby districts, but none of whom are teach- 
ers or administrators in the New Haven district. This 
approach ensures that the third-party validators are 
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objective raters (they do not provide feedback), an 
important quality given that their primary role is to 
validate principals’ scores for the highest and lowest 
performers. DC recruits both locally and nationally 
for its master educators, a strategy that officials say 
allows the district to hire the best talent. 

She and her team filled 41 peer- 
evaluator spots with educators who 
had deep expertise in specific grades 
and content areas. These peers ensure 
not only that all teachers receive 
ratings and feedback from relevant 
experts, but also that principals have 
the resources to improve their own 
observation and feedback skills. 5J 

While relying on expert raters to alleviate admin- 
istrators’ workloads, districts also depend on them 
to provide content and grade-level expertise that 
administrators may lack. Lori Renfro, who directs 
the new teacher evaluation system used by several 
districts in Maricopa County, knew that with higher 
expectations for feedback, “principals would need 
help with content.” 21 With this in mind, she and 
her team filled 41 peer-evaluator spots with educa- 
tors who had deep expertise in specific grades and 
content areas. These peers ensure not only that all 
teachers — even music and physical education teach- 
ers — receive ratings and feedback from relevant 
experts, but also that principals have the resources to 
improve their own observation and feedback skills. 

Only two districts, Greenville County and 
Transylvania County, include actual peers — full-time 


classroom teachers — among raters of colleagues’ 
practice. State policies in both North Carolina and 
South Carolina require that peers help evaluate pro- 
visional early career teachers. Both states require 
that peer observers undergo district-provided train- 
ing, but they do not hold special contracts or take 
on other leadership duties as is common for master 
teachers elsewhere. As do many districts with more 
formal master-teaching positions, both Greenville 
County and Transylvania County aim whenever 
possible to pair teachers with peer evaluators who 
teach in the same grade or content areas. 

How Are Raters Deployed? 

Even when districts have similar policies about who 
can rate teachers, they often deploy raters differently. 
In several of the districts where rating responsibilities 
are restricted to administrators, building-level teams 
are free to divide and share the observation duties as 
they see fit. Several districts — including Transylvania 
County and New York City, the report’s smallest and 
largest districts, respectively — provide this flexibility. 

A handful of other districts provide more specific 
guidance on administrators’ observational roles. In 
Santa Fe, for example, all teachers are now observed 
by their own principals or assistant principals and 
by an administrator from another school within 
the district; teachers rated minimally effective are 
also observed by the assistant superintendent. Almi 
Abeyta, the district’s chief academic officer, reports 
that the new model, in just its second year, has im- 
proved calibration between principals and assistant 
principals. And when there have been questions 
about a teacher’s performance, it has often con- 
firmed what the school-based administrator had 
initially observed. 22 
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Similarly, in Boston Public Schools, administrators 
serve as primary evaluators for some teachers and 
secondary evaluators for others. Teachers are as- 
signed a primary evaluator, but they might also have 
multiple administrators serving as secondary evalu- 
ators so that school leadership teams can establish 
trends in teachers’ performance over time. While 
both types can conduct observations and submit 
evidence to teachers’ online portfolios, raters score 
evidence and provide a final performance rating only 
for those teachers in their primary caseload. Angela 
Rubenstein, a district implementation specialist, 
says this division of labor “distributes workload so 
caseloads are more manageable and so teachers get a 
lot of opportunities to generate data for their evalu- 
ation.” It also ensures that evaluators “have enough 
interaction with teachers to provide meaningful 
feedback for development,” she says. 23 

Teachers are assigned a primary 
evaluator, but they might also have 
multiple administrators serving as 
secondary evaluators so that school 
leadership teams can establish trends 
in teachers’ performance 



Greene County, a large rural school system, was 
the only district in which administrators frequently 
observed classrooms in tandem. In a pilot project 
supported by the state’s federal Race to the Top 
grant, principals and assistant principals co-observe 
with counterparts from similar schools within the 
district. They rate lessons separately, then meet to 
reach a consensus on scores and feedback. Though 


currently limited only to the district’s lowest-per- 
forming teachers, the pilot has provided three key 
benefits, Greene county officials say: greater ob- 
jectivity in scoring, broader expertise in providing 
feedback, and professional development for princi- 
pals, who, according to Superintendent Vicki Kirk, 
have “really, really enjoyed breaking down [teachers’] 
lessons and norming their interpretations of the ru- 
bric with colleagues.” 24 

Co-observation also occurred in districts in which 
master, mentor, or peer teachers serve as expert peer 
raters, although the practice is required only in New 
Haven and New York City, where some teachers are 
observed by both their principals and external vali- 
dators. In New York City, the policy applies only 
to the lowest performing teachers; New Haven uses 
validators to co-observe both its lowest- and highest- 
scoring teachers. 

Most other districts using expert peer raters sim- 
ply divide observation responsibilities among 
combinations of raters, with master teachers often 
shouldering half or more of the observation and 
feedback requirements for each teacher. 

The workloads of the expert raters depend largely 
on districts’ needs and the nature of their contracts. 
In DeSoto Parish, DC, Hillsborough County, and 
Maricopa County, master teachers (DeSoto and 
DC) or peers (Hillsborough and Maricopa) devote 
all of their time to observing, rating, coaching, and 
leading professional development sessions for teach- 
ers and sometimes principals. In Eagle County, on 
the other hand, master teachers maintain about 30 
percent of their teaching duties, devoting the other 
70 percent to evaluation and staff development, 
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with each carrying a caseload of about 20 teachers. 
Eagle’s 20 master teachers work under 201 -day con- 
tracts — 20 days longer than the contracts held by 
the district’s classroom teachers. 25 

Both Eagle County and DeSoto Parish also em- 
ploy mentor teachers who perform many of the 
same duties as masters (doing observations, provid- 
ing feedback, etc.) but carry heavier teaching loads 
and devote less time to observation and feedback 
responsibilities. 

Transylvania County, one of the two districts that 
include peer raters in their system, also distributes 
responsibilities across raters, but not quite so evenly. 
Administrators in Transylvania conduct three obser- 
vations for each teacher up for promotion to career 
contract status; peers contribute just one. 

Challenges to Using Multi-Rater Systems 

In making decisions about how to design multi-rater 
systems, districts had to consider their goals for eval- 
uation, the capacity of their existing systems, and 
the resources available to improve or expand them. 
Many districts faced tradeoffs: adding raters may 
bring much-needed expertise and reduced workloads 
for principals, but such improvements were often 
costly and frequently introduced new challenges. 


It is essential, first, that each rater judge every teacher 
by the same standard — that a score of 5, for instance, 
represents the same caliber of instruction each time 
it is applied. At the same time, when several raters 
are judging, they must all agree on what each of the 
rating levels looks like. To make sure their ratings are 
reliable in this way, they must meet several times a 
year to recalibrate their scoring. Ensuring this kind 
of quality is expensive, often more costly than the 
initial training. 

At DCPS, Stephanie Aberger, the director of 
IMPACT’S training platform, says that robust in- 
troductory training includes about 10 hours of 
preliminary work online and another 10 hours of 
in-person support. “This is a significant time com- 
mitment,” Aberger says, “especially for a busy new 


(( It is essential that each rater judge 
every teacher by the same standard — 
that a score of 5, for instance, 
represents the same caliber of 

instruction each time it is applied. 99 


school leader.” The work is funded with a $2.2 mil- 
lion dollar grant from the Bill & Melinda Gates 
Foundation. 


Rater Reliability 

An especially big problem for districts is how to 
train observers to properly rate classroom practice. 
Training is important both for the technical chal- 
lenge of making sure that ratings are consistent, 
common, precise, and reliable and for the signifi- 
cance that accurate ratings hold for an entire school 
system. 


Yet thorough and repeated training is essential to the 
process of accurately identifying good teaching and 
to building a shared understanding of performance. 
Just as previous “drive-by” teacher evaluations were 
essentially meaningless, and signaled as much to the 
teaching profession, the new generation of evalu- 
ations can help build a shared sense of what good 
teaching looks like in a district — or breed mistrust 
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when scores for comparable teachers are radically 
different. Teachers lose faith in the system when 
scores are obviously out of sync. Administrators, 
when guided by inaccurate ratings, can make poor 
judgments, sometimes in high-stakes cases. Training 
raters effectively, and recalibratig them periodically 
to ensure the quality of ratings, is an important topic 
that Carnegie will address in an upcoming brief. 

Financial Cost 

The cost of multi-rater systems depends on the spe- 
cifics of their design. Hillsborough County, a district 
that increased both the frequency of observation and 
the number of eligible raters, spent $11.9 million on 
its evaluation system in 201 1-20 12. 26 More than 85 
percent of that total — nearly $10.4 million — went 
to support its multi-rater observation model, accord- 
ing to a 2013 report from the RAND Corporation 
and the American Institutes for Research. 


Costs associated with the four key components 
of Hillsborough’s observation model: 


Design and implementation $525,580 

Includes cost of training and calibrating raters 

Peer and mentor observers $8,122,558 

Includes salaries and benefits for 189 peer and mentor 
teachers and the teachers hired to take over their full- 
time teaching loads 

Management and communications $316,740 

Includes salaries for central office staff managing the 
system and fees to contractors for communications and 
management support 

Technology and data systems $1,432,988 

Includes an online platform to house and aggregate 
data on teachers’ performance and laptops for peer and 
mentor observers 


Total costs associated with Hillsborough 
observation in 2011-2012 $lO,396,865 27 


Although this sum may be a relative drop in the 
bucket of Hillsborough’s total annual operating bud- 
get of $2.3 billion, it’s worth noting that nearly two 
thirds of the $24.8 million Hillsborough spent on 
its evaluation system from 2009 to 2012 came from 
external grant funding. 28 

Hillsborough is not alone in this regard. Significantly, 
Eagle County, DC, Maricopa County, and others re- 
ceived significant external support from state Race 
to the Top awards, federal Teacher Incentive Fund 
grants, and private foundations to design, launch, 
and initially sustain their intensive and costly multi- 
rater systems. Most of the remaining districts relied 
primarily on local and state funding to get their less- 
intensive systems up and running. 

Perhaps because of these funding differences, most 
districts pursued less resource-intensive designs than 
did Hillsborough or DC — particularly models that 
use multiple raters for only lowest-performing teach- 
ers or those that recommend rather than require 
multiple raters. Only one district — Santa Fe — 
implemented a multi-rater requirement for all 
teachers without increasing the number of raters 
in the district. The district will employ two retired 
principals in 2014-2015 to help cover increased 
observation duties when it moves to a more differ- 
entiated system. 29 But several districts — Santa Fe 
included — said they would reconsider elements of 
their designs if they had more money to train and 
compensate more raters. 

Staffing Considerations 

Even if resources exist to employ additional raters, 
many districts have little opportunity to do so, since 
state policies or local contractual language sometimes 
restricts evaluation tasks to licensed administrators. 
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In Chicago, an effort to allow department chairs 
to serve as evaluators was rejected by the Chicago 
Teachers Union which, like unions elsewhere, raised 
concerns about allowing teachers from the same col- 
lective bargaining unit to evaluate one another. 30 
DC, which faces similar contractual restrictions, was 
able to use its master educators as raters by including 
them in the local administrators’ union. 


teachers from within — both to thwart trust issues 
associated with “external” raters and to provide 
meaningful leadership opportunities for their own 
high-performing teachers. But hiring raters from 
within may also introduce complex interpersonal 
dynamics, as teacher leaders adjust to their new 
authority and redefine relationships with former 
colleagues. 32 


Districts with greater flexibility still had to weigh the 
pros and cons of engaging non-administrative rat- 
ers in their evaluation processes. Though they can 
expand the system’s bandwidth and expertise, non- 
administrative raters — whether they are external 
raters, master teachers, or peers — are not always wel- 
comed by principals and teachers, who may doubt 
their motives and value. When Maricopa County 
introduced expert peer evaluators, teachers and 
principals experienced an “initial shock,” says Lori 
Renfro; they did not immediately trust the peers or 
their motives in the evaluation process. 31 To foster 
more personal relationships between educators and 
the peer evaluators, many Maricopa County peer 
evaluators now meet with principals and teachers 
before the evaluation cycles begin in informal, one- 
on-one meetings and school-level “meet and greets.” 
And, to ensure that peers aren’t spread too thinly 
across the district, Maricopa has reconfigured their 
assignments, giving them more time to spend on 
fewer campuses. Maricopa officials hope the move 
will build stronger relationships between the roving 
peers and school-based educators. 

Though Maricopa, DC, New ffaven, and other dis- 
tricts hire non-administrative evaluators primarily 
from outside the district (an expensive process in 
itself), many other districts hire master or mentor 


it Hiring raters from within may 
also introduce complex interpersonal 
dynamics, as teacher leaders 
adjust to their new authority and 
redefine relationships with former 

colleagues. 99 


Logistics and Coherence 

Adding new raters — whether newly certified assistant 
principals, external raters, or master teachers — re- 
quires more careful coordination, both between the 
district and schools and within schools themselves. 
Systems must be in place to schedule and monitor 
classroom visits, to collect and store observation 
data from multiple raters, and, most important, 
to provide time for raters to calibrate their scoring 
practices and ensure they are providing consistent, 
coherent feedback. 

Implementing such systems, however, is often easier 
said than done. Building and maintaining databases 
to house observation data can be time-consuming, 
as well as costly, as evidenced by Hillsborough’s bud- 
get on page 9. And it can be difficult in a busy school 
week to carve out time for raters to discuss their 
scores and align feedback with each other. 
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The districts cited these challenges almost unani- 
mously, but several have found ways to mitigate their 
effects. They now rely on a range of online applica- 
tions to aggregate and store evaluation data from 
various raters. Many districts allow teachers to access 
these systems and upload artifacts of their teaching 
practice, making the process more transparent and 
interactive than ever before. 

Several districts have also hired full-time staff to 
help manage their evaluation processes. Along with 
Patty Fox, who coordinates Greenville County’s 
evaluation system, the district also employs seven 
full-time lead teachers who, in addition to serving 
as raters on the district’s evaluation teams, schedule 
visits for their teams, collect and input data from 
team members, and generate official reports. Based 
in the district’s central office, these lead teachers 
handle caseloads of about 50 teachers at a time and, 
according to Fox, have been “absolutely huge” in 
making the process manageable for school leaders. 
Similar positions existed in several of the districts, 
but smaller districts employ fewer full-time staff 
in such roles and seemed to rely more heavily on 
principals to coordinate the process. 

With multiple raters, it is important for districts to 
ensure that raters are carefully calibrated, particular- 
ly as some districts found that non-principal raters 
produced lower scores, on average, than principals 
(a finding consistent with several recent studies). 
In two districts in which this difference was most 
pronounced — both districts in which expert peers 
serve in full-time evaluation roles — officials attrib- 
uted the pattern largely to the fact that expert peer 
raters were hand-selected for their roles and de- 
ployed exclusively to evaluate and provide feedback 


(and to perform accompanying reporting duties). 
Many administrators, on the other hand, were not 
hired for their ability to rate teachers’ practice; some 
were even reluctant to do so, either because they 
lacked confidence in their evaluative skills or sim- 
ply because they were skeptical of the process itself. 
And all had other administrative duties occupying 
their time. 

A small number of districts (including one of those 
with large gaps in administrator and expert peers’ 
scores) also noted that expert raters undergo more 
intensive training than principals, in the hope that 
they will coach and assist the principals where neces- 
sary. Though this strategy may conserve resources, 
it also raises concerns about inadequate and incon- 
sistent training. Raters who do not receive proper 
training and support may introduce issues of reli- 
ability into evaluation scores, compromising the 
quality of data used to make judgments about teach- 
ers’ practice, even threatening their employment. 
Such issues could plague any evaluation system, but 
multi-rater systems that intentionally apply different 
training standards to raters may be particularly vul- 
nerable. In practice, the strategy also received mixed 
reviews: though some districts using this approach 
found it beneficial for improving administrators’ 
abilities and the collegiality between the raters, oth- 
ers noted that it created tension, as when principals 
felt they were being “policed” unfairly by other, 
more highly-trained raters. 

A small number of districts mentioned deliberate ef- 
forts to track and coordinate the feedback teachers 
receive from various raters, but few had clear, estab- 
lished systems to ensure that feedback was consistent 
and coherent. DeSoto Parish, a standout in this area, 
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has worked hard to define raters’ various roles and 
enlists master teachers to coordinate the information 
and support teachers receive from principals, men- 
tor teachers, and one another. Kathy Noel, DeSoto 
Parish’s director of student learning, says the master 
teachers are “the maestros of the whole [evaluation 
and support] system .” 33 


it A small number of districts 
mentioned deliberate efforts to track 
and coordinate the feedback teachers 
receive from various raters, but few 
had clear, established systems to 
ensure that feedback was consistent 

and coherent. 33 


In some cases, districts’ attempts to better coordinate 
feedback may be hindered by local policy. In DC, 
for example, an information “firewall” prevents the 
master educators and the district’s fleet of instruc- 
tional coaches from freely sharing any details about 
their interactions with specific teachers. Put in place 
to allay concerns from the local union (such as mem- 
bers of the administrative union sharing evaluation 
data with non-administrative coaches), the firewall 
makes it challenging to ensure that teachers receive 
consistent feedback from the various instructional 
leaders. This is the case even though the district has 
taken steps to assess and improve coherence among 
various feedback providers. 

A handful of other districts have sought to coordi- 
nate feedback by allowing various raters to exchange 
information through an online evaluation portal. It 


is unclear, however, how often raters in each district 
use this function, or whether other feedback provid- 
ers (such as coaches) can access these portals or input 
their own information. Without that functionality, 
or other efforts to coordinate between the various 
providers, teachers are likely to be overwhelmed 
by the amount and range of support they receive. 
And this outcome may have the unintended ef- 
fect of undermining systems’ aims for instructional 
improvement. 

Conclusion 

Though it is too early to determine the full impact 
of multi-rater evaluation systems, it is clear that 
districts have high hopes for their ability to gather 
better information, produce more accurate, objec- 
tive ratings, and improve the quality of feedback 
teachers receive, all while reducing principals’ work- 
loads. Emerging research suggests that many of these 
expectations are well-warranted. 

Districts’ unique aspirations for their systems, along 
with their policy contexts, led to designs that varied 
along three main dimensions: the type of teachers 
evaluated by multiple-raters, the type of raters used, 
and how raters are deployed. Districts also faced 
certain common challenges in implementing their 
designs. Chief among these were issues related to 
cost, raters’ capacity, and system coherence. 

Perhaps as a result of these challenges, multi-rater 
systems are not yet commonplace in the nation’s 
schools . 34 Only four states — North Carolina, South 
Carolina, New Jersey, and Maryland — require 
that teachers be observed by multiple raters, but 
these requirements apply only to novices or the 


CARNEGIE FOUNDATION FOR THE ADVANCEMENT OF TEACHING 




THE RISE, REWARDS, AND RISKS OF MULTI-RATER TEACHER OBSERVATION SYSTEMS 


13 


lowest-performing teachers. In one survey of mid- 
dle school math teachers in Missouri, fewer than 23 
percent reported having been rated by more than 
one observer. 35 But, given a recent assertion by the 
Brookings Institution that “nearly all the opportuni- 
ties for improvement to teacher evaluation systems 
are in the area of classroom observations rather than 
test score gains,” it seems likely that policymakers 
and district officials will soon double down on ef- 
forts to strengthen observation processes. 36 And 
when they do, multi-rater systems stand to become 
a prevalent strategy for improvement. 


Taylor White is a former associate for public policy engage- 
ment at the Carnegie Foundation for the Advancement of 
Teaching. She now serves as the deputy director of education 
policy and university research at the Australian Embassy in 
Washington, DC. 
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Transylvania County, 
NC 

DeSoto Parish, 
LA 

Eagle County, 
CO 

Greene County, 
TN 

Santa Fe, 
NM 

Maricopa County, 
AZ (REIL Only) (2) 

New Haven, 
CT 

DISTRICT DATA (1) 


Students 

3,557 

5,040 

6,344 

7,467 

14,071 

48,347 

20,554 

Schools 

9 

13 

21 

18 

30 

100 

49 

Teachers 

261 

379 

429 

470 

949 

2,420 

1,615 

EVALUATION SYSTEM INFORMATION 

System Name 

NC Educator Effectiveness 
System 

TAP 

Professional Excellence, 
Appraisal, and 
Recognition (PEAR) 

Tennesee Educator 
Acceleration Model 
(TEAM) 

NM TEACH 

REIL 

TEVAL 

Implementation Date 

2011-2012 

2008-2009 

2009-2010 

2011-2012 

2013-2014 

2010-2011 

2009-2010 

Components of Rating 

Teachers scored by 
principal on 5 domains 
of practice. Sixth 
component is based on 
student achievement. 

Composite of Skills, 
Knowledge, and 
Responsibility Score 
(converted from TAP 
rubric scale to LA state 
sysem's scale) and 
Student Growth Measure 

50% Measures of 
Professional Practice + 
50% Measures of Student 
Learning 

35% Student Growth 
+ 15% Academic 
Achievement + 50% 
Observation 

25% Observation + 25% 
Multiple Measures + 50% 
Student Achievement 

50% Observation 
+ 40% Individual 
Growth + 5% Team 
Growth + 5% School 
Growth 

Matrix: Instructional 
Practice and 
Professional Value 
Score + Student 
Learning Growth Score 

Effectiveness Levels 

4 

5 

5 

5 

5 

4 

5 

Minimum number of required, 
summative observations 
(annually) 

3 (career status)/ 4 
(probationary) 

3 

3 

4 (>3 years experience)/2 
(3+ years experience)/1 
(teachers with past "5" 
ratings) 

3 

4 

3 

RATERS POLICIES 

What is the state policy on 
multiple raters? 

Probationary teachers 
undergo 3 observations 
from principal and one 
from a peer. Career 
teachers undergo 3 
observations from 
principal. 

No requirement 

No requirement 

No requirement 

No requirement 

No requirement 

No requirement 

Who, in addition to 
administrators, can contribute 
obesrvation scores to teachers' 
summative evaluations? 

Actual Peers (3) 

Master Teachers 

Master Teachers 

Administrators only 

Administrators 

Expert Peers 

Third Party Evaluators 

Which teachers are required to 
be observed by multiple raters? 

Probationary Teachers/ 
Contract Teachers 

Yes 

All 

Teachers who scored a"1" 
on previous evaluation. 

All 

All 

Highest and Lowest 
Performing Teachers 

May other teachers be observed 
by mulitple raters? 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

How do various raters contribute 
to teachers' summative score(s)? 

For probationary 
teachers, peers contribute 
1 of the 4 required 
observations scores. 
For Contract status 
teachers, administrators 
may divide the required 
observations. 

Frequent scored and 
unscored observation 

Mentor teachers' ratings 
account for 35% of 
teachers' professional 
practice score. Principals' 
scores make up the 
remaining 65%. Mentor 
teachers also conduct a 
required observation for 
each teacher, but these 
are not counted in final 
summative evaluation 
scores. 

Administrators co- 
observe low-perfomers. 
Observation "teams' 
consist of a teacher's 
own principal/APand 
a principal/APfrom 
another school in the 
district. The home 
principal issues the final 
summative score. 

Third required 
observation must 
be conducted by 
administrator who did 
not conduct first two 
observations. 

Frequent scored 
observation. 

Administrator and TPV 
co-observe 3 lessons. 
TPVs' scores are kept 
separate and reviewed 
only devaluation 
scores become 
consequential. 


(1) All district data from National Center for Education Statistics, based on CCD Public school district data for the 2011-2012, 2012-201 3 school years. 

(2) The REIL project involves twelve districts in Maricopa County in greater Phoenix, AZ. This report examines only policies in participating REIL districts. 

(3) The report uses the term "actual peer" to refer to peer observers who occasionally observe, rate, and provide feedback to other teachers, but do not spend the bulk of their time in this capacity. The report uses the 
term "expert peer" to refer to the master teachers, master educators, and third party evaluators employed as full- or nearly full-time in an evaluative capacity. 
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Tulsa, 

OK 

Washington, 

DC 

Boston, 

MA 

Greenville, 

SC 

Baltimore City, 
MD 

Gwinnett County, 
GA 

Hillsborough 
County, FL 

Chicago, 

IL 

New York, 
NY 


41,199 

44,618 

55,027 

72,153 

84,212 

132,370 

194,041 

403,004 

968,143 

94 

134 

135 

96 

195 

134 

326 

647 

1,523 

2,457 

3,472 

4,261 

4,376 

5,532 

10,323 

13,862 

22,460 

62,368 

TULSA Model 

IMPACT 

Educator Effectiveness 
System 

Performance Appraisal 
System for Teachers 

Teacher Effectiveness 
Evaluation 

Teacher Effectiveness 
System 

Empowering Effective 
Educators 

Recognizing Educators 
Advancing Chicago's 
Students (REACH) 

Advance 

2011-2012 

2009-2010 

2012-2013 

2006-2007 

2013-2014 

2013-2014 

2010-2011 

2012-2013 

2013-2014 

100% Professional 
Practice 
(observations, 
student surveys, 
etc.) 

40-75% Teaching and 
Learning + 15-50% 
Student Achievement 
Data + 10% 
Commitment to School 
Community (4) 

Matrix: Instructional 
Practice and 
Professional Values 
Score & Student 
Learning Growth Score 

100% Professional 
Practice (observations, 
student surveys, etc.) 

85% Observations + 
15% Professional 
Expectations Measure 

50% Observation & 
Survey Scores + 50% 
Student Growth & 
Academy Achievement 

35% Pricipal Appraisal 
+ 25% peer/Mentor 
Appraisal + 40% 
Student Achievement 
Gains 

0-25% Student 
Growth + 75-100% 
Teacher Practice 
(varies by grade-level, 
subject) 

40% Measures of 
Student Learning 
+ 60% Measures of 
Teacher Practice 

5 

5 

4(5) 

4 

4 

4 

4 

4 

4 

2 (contract 
teachers)/4 
(probationary 
teachers) 

1 to 5 (varies based on 
prior ratings) 

1 to 5 (varies based on 
prior ratings) 

0 (first year teachers)/ 6 
(second year teachers)/ 
1 (all others) 

2 

2 full-length + 4 short 
visits 

3 to 11 (varies based 
on prior ratings) 

2 (tenured teachers)/4 
(probationary 
teachers) 

4 to 6 (varies by 
effectiveness level and 
teacher preference) 

No requirement 

Required for all teachers 
except "highly effective' 

No requirement 

Required for novice & 
lowest performers 

Required for low 
performers 

No requirement 

No requirement 

No requirement 

No requirement 

Administrators only 

Master Educators 

Department Heads, 
Specialists, & Peers (6) 

Actual Peers 

Administrators only 

Administrators only 

Expert Peers, 
Supervisors 

Administrators only 

(7) 

Administrators and 
Validators 

No requirement 

All teachers rated 
"effective"or below 

All 

Probationary teachers 
eligible to move to 
continuing contracts 

Teachers previously 
rated "ineffective" 

No requirement 

All 

No requirement 

Teachers rated 
"ineffective" in 
previous year 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

A single 

administrator must 
conduct mandatory 
observations. 
Other building 
administrators can 
contribute additional 
observation or 
evidence. 

Administrators conduct 
two mandatory 
observations and collect 
evidence throughout 
school year. Master 
teachers conduct two 
observations. 

Primary Evaluator 
issues summative 
rating. Secondary 
Evaluator conducts 
observations and 
collects evidence to be 
considered in primary 
rater's summative 
rating. 

Observation scores 

Observation scores 

Administrators share 
observation duties and 
any other duties related 
to evaluation, but 
principals must sign off 
on summative ratings. 

Observation scores 

Principals and 
Assistant Principals 
share observation 
duties and other 
duties related to 
evaluation. 

Validators score 
teachers rated 
"ineffective" in the 
previous evaluation 
cycle. Administrators 
may share other 
evaluation duties. 


(4) In 2014-2015, all DCPS IMPACT scores will be calculated using the 75% Teaching and Learning + 15% Student Achievement + 10% Commitment to School Community formula because the district will not be 
calculating value-added scores during the first year of PARCC implementation. Teacher-assessed student achievement data (TAS) will be make up the 15% Student Achievement data component. 

(5) Evaluations in Boston include two separate components: a summative performance rating (4 possible ratings) and a student impact rating (3 possible ratings). 

(6) Only one or two campuses in Boston have established peer observation systems through special arrangements with the district. This report does not examine those cases, but acknowledges they exist. 

(7) Chicago also employs Instructional Effectiveness Specialists to support principals in the evaluation process. They regularly co-observe with principals to help calibrate scoring, but their scores do not count toward 
teachers' summative ratings. 
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